-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Intel Advanced Matrix Extensions (AMX) support to ggml #7707
base: master
Are you sure you want to change the base?
Conversation
This PR also adds openmp support since the original pthead sync is done via atomic which has a very high overhead on server CPUs (and the sync has to be done very frequently for each operator launch). This is not my initial target but I have to fix it by using other threading runtimes, openmp or tbb. Otherwise the performance speedup will be cut off quite q bit. I noticed #7606 is also doing this, this should also work. |
BTW why AMX will greatly improve next token latency? |
Here is my suggestion:
|
I also wrote an vnni kennel for gemv cases. |
sure, the BKMs for intel also need to be updated. |
Updates: f16 support added Right now this patch only has a avx512 kernel which is doing fma with Also postpone bf16 amx kernels support to align with f16 amx timeline. Since bf16 is not that common is gguf. Performance: Tested on
|
I experimented with this in f3974ca by moving all matrix multiplication to the BLAS backend. Generally I think the performance is ok, maybe 1-2% slower for small models (<1B), but the difference is very small for most models. |
I think the implementation is good!
There is only miss the guide of this feature. Thank you! |
I think it would be better to leave the implementation as is instead of moving it to a different backend, the performance would be slightly better, and I don't really see a good reason to split the CPU backend into multiple backends. The changes in |
AMX is a new built-in accelerator available from the 4th generation of Xeon, the Intel sever CPU, link. So this PR is actually trying to improve the performance of llama.cpp on intel server CPUs. And AMX is not equal to the concept of BLAS. @slaren I don't quite get your idea, should I continue with OMP changes in ggml.c shall be gone after rebasing. Currently I am working on QK_K AMX kernels and I will clear up this PR once it is done. |
I was responding to @ggerganov suggestion to move the implementation to a different backend similar to the BLAS backend. I think you should continue as is. |
Ok, let's proceed as is |
Is there any progress? |
Recently I got distracted by some other tasks, I use my spare time to work on this project as this is not an official task from my employer. Currently I am working on the Q4K quant format, have to say that it is much more complexed... Anyway it's about to be finished. |
Added AMX and VNNI kernels for |
837811b
to
b8a9828
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the code is very well isolated from the rest of the codebase. Haven't reviewed the mmq.cpp
source in details yet and it would be difficult without the appropriate hardware, but I think that's alright as we can easily determine that there won't be side effects to the rest of the codebase.
Wondering how we could add some tests for this functionality.
Overall, seems good to me. @slaren What do you think?
Yeah... the CI is a big problem. I will try to find some internal sponsor and then we can use our company cloud, that would be the best. Otherwise, we will have to go the emulator. @ggerganov i was wondering how the |
We don't have CI for the CANN backend either. For aarch64, I'm planning to try to rent an Arm machine on the Azure cloud when they become available and if they are not too expensive |
I think the Sapphire Rapids Xeon (4th generation Xeon) support AMX. |
Hi, I noticed some quantization issues in
|
the weight packing for Q8_0 is here https://github.com/mingfeima/llama.cpp/blob/74bb1eb52be7d9b9eb484d156d24a474dd09f278/ggml/src/ggml-amx/mmq.cpp#L866-L873 each weight block of 16x32 (NxK) is stored in the format of (KxN) so that we can do FMA here, and this block will have 16 scales (d0), it is packed as a contiguous vector of 1x16, the dtype is f16. So to sum up, the scale is a 256-bit vector which corresponds to 16 columns. So it is not a "single scale parameter for all 16x32 weights". If the computation is wrong, the llm will talk like crazy. |
@ggerganov On Azure, Is that possible to use those instances for CI ? |
add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS
Thanks for letting me know - I just added an AMX VM (EC8eds v5) to the It won't run on this PR since ggml-ci runs only on branches in this repository. So the AMX CI will run after we merge the PR in I've also sent you a collaborator invite, if you'd like you will be able to push branches in this repository and be able to run the CI prior to merging in the future. |
74bb1eb
to
82755ed
Compare
This PR improves intel server CPU performance with intel advanced matrix extension (AMX).
AMX is a new built-in accelerator for gemm starting from 4th gen Xeon: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
The basic idea is pretty much the same as what i have done in PyTorch pytorch/pytorch#117475 for the int4 and int8 mixed dtype gemms.
Features
__ARM_FEATURE_MATMUL_INT8
), more support will be added in the feature. Kernels are placed inggml-amx.cpp
since I don't want to mess up with ggml.c which is already very complexed and the amx kernels could also be more complexed in future if add more qformat support.ggml-quant.c
.Performance
benchmark-matmult
(metic: gFlops):TODO:
add more quantized dtype supportadd bf16 gemm support with amx-bf16 (using avx512-bf16 for gemv)add f16 gemm support with amx-f16 (using avx512-f16 for gemv)I also noticed from vtune that some pointwise operators need additional optimization, e.g. softmax, etc. Will handle them later on.